This notebook shows how BigBang can help you explore a mailing list archive.
First, use this IPython magic to tell the notebook to display matplotlib graphics inline. This is a nice way to display results.
In [1]:
%matplotlib inline
Import the BigBang modules as needed. These should be in your Python environment if you've installed BigBang correctly.
In [2]:
import bigbang.mailman as mailman
import bigbang.graph as graph
import bigbang.process as process
from bigbang.parse import get_date
#from bigbang.functions import *
from bigbang.archive import Archive
Also, let's import a number of other dependencies we'll use later.
In [4]:
import pandas as pd
import datetime
import matplotlib.pyplot as plt
import numpy as np
import math
import pytz
import pickle
import os
pd.options.display.mpl_style = 'default' # pandas has a set of preferred graph formatting options
Now let's load the data for analysis.
In [4]:
urls = ["",
archives = [Archive(url,archive_dir="../archives") for url in urls]
activities = [arx.get_activity() for arx in archives]
This variable is for the range of days used in computing rolling averages.
In [5]:
window = 100
For each of the mailing lists we are looking at, plot the rolling average of number of emails sent per day.
In [6]:
plt.figure(figsize=(12.5, 7.5))
for i, activity in enumerate(activities):
colors = 'rgbkm'
ta = activity.sum(1)
rmta = pd.rolling_mean(ta,window)
rmtadna = rmta.dropna()
label=mailman.get_list_name(urls[i]) + ' activity',xdate=True)
Now, let's see: who are the authors of the most messages to one particular list?
In [7]:
a = activities[0] # for the first mailing list
ta = a.sum(0) # sum along the first axis
This might be useful for seeing the distribution (does the top message sender dominate?) or for identifying key participants to talk to.
Many mailing lists will have some duplicate senders: individuals who use multiple email addresses or are recorded as different senders when using the same email address. We want to identify those potential duplicates in order to get a more accurate representation of the distribution of senders.
To begin with, let's do a naive calculation of the similarity of the From strings, based on the Levenshtein distance.
This can take a long time for a large matrix, so we will truncate it for purposes of demonstration.
In [9]:
import Levenshtein
distancedf = process.matricize(a.columns[:100], lambda a,b: Levenshtein.distance(a,b)) # calculate the edit distance between the two From titles
df = distancedf.astype(int) # specify that the values in the matrix are integers
In [10]:
fig = plt.figure(figsize=(18, 18))
#plt.yticks(np.arange(0.5, len(df.index), 1), df.index) # these lines would show labels, but that gets messy
#plt.xticks(np.arange(0.5, len(df.columns), 1), df.columns)
The dark blue diagonal is comparing an entry to itself (we know the distance is zero in that case), but a few other dark blue patches suggest there are duplicates even using this most naive measure.
Below is a variant of the visualization for inspecting the particular apparent duplicates.
In [11]:
levdf = process.sorted_lev(a) # creates a slightly more nuanced edit distance matrix
# and sorts by rows/columns that have the best candidates
levdf_corner = levdf.iloc[:25,:25] # just take the top 25
In [12]:
fig = plt.figure(figsize=(15, 12))
plt.yticks(np.arange(0.5, len(levdf_corner.index), 1), levdf_corner.index)
plt.xticks(np.arange(0.5, len(levdf_corner.columns), 1), levdf_corner.columns, rotation='vertical')
For this still naive measure (edit distance on a normalized string), it appears that there are many duplicates in the <10 range, but that above that the edit distance of short email addresses at common domain names can take over.
In [13]:
consolidates = []
# gather pairs of names which have a distance of less than 10
for col in levdf.columns:
for index, value in levdf.loc[levdf[col] < 10, col].iteritems():
if index != col: # the name shouldn't be a pair for itself
consolidates.append((col, index))
print str(len(consolidates)) + ' candidates for consolidation.'
In [14]:
c = process.consolidate_senders_activity(a, consolidates)
print 'We removed: ' + str(len(a.columns) - len(c.columns)) + ' columns.'
We can create the same color plot with the consolidated dataframe to see how the distribution has changed.
In [15]:
lev_c = process.sorted_lev(c)
levc_corner = lev_c.iloc[:25,:25]
fig = plt.figure(figsize=(15, 12))
plt.yticks(np.arange(0.5, len(levc_corner.index), 1), levc_corner.index)
plt.xticks(np.arange(0.5, len(levc_corner.columns), 1), levc_corner.columns, rotation='vertical')
Of course, there are still some duplicates, mostly people who are using the same name, but with a different email address at an unrelated domain name.
How does our consolidation affect the graph of distribution of senders?
In [17]:
fig, axes = plt.subplots(nrows=2, figsize=(15, 12))
ta = a.sum(0) # sum along the first axis
ta[-20:].plot(kind='barh',ax=axes[0], title='Before consolidation')
tc = c.sum(0)
tc[-20:].plot(kind='barh',ax=axes[1], title='After consolidation')
Okay, not dramatically different, but the consolidation makes the head heavier. There are more people close to that high end, a stronger core group and less a power distribution smoothly from one or two people.